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I. Introduction 


This document provides guidance in writing effective software for the Cyrix 6x86™ and 
6x86MX™ processors. Differences between 6x86 and 6x86MX CPUs are listed in 
Appendix A. The Cyrix Software Developer web site (www.cyrix.com) provides current 
information and code examples on topics such as CPU Detection and Cache Line Locking. 


1.1. No Instruction-Pair Optimization Needed 


It is important to point out, that the Cyrix 6x86 and 6x86MX require no instruction-pair 
optimization as does the Pentium® CPU. The reason is that the execution pipelines in the 
6x86/6x86MX are more balanced than those in the Pentium. Both legacy 16-bit code, and 
Pentium optimized 32-bit code pass through the Cyrix execution pipelines with the same 
efficiency. 
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Remove Address Generation Interlocks 


2. 6x86 Family Coding Suggestions 


The suggestions in this section will help both the programmer to produce higher 
performance software when using the 6x86 and the 6x86MX processors. 


2.1. Remove Address Generation Interlocks 
Separate Address Generation Interlocks (AGIs) by 2 cycles (2 to 4 instructions). 
For example, avoid: 


add cx,bx 
add dx, [cx] 


In this example cx in the second instruction cannot be used until the cx in the 
first instruction is finished updating. This coding will produce a two clock bub- 
ble which results in a four instruction penalty. 


Another common example to avoid: 


mov eax, [eax] 
mov eax, [eax] 
mov eax, [eax] 


cmp eax, whatever 


2.2. Don’t Move Variables Into Registers For Speed 


It has been a common practice to place frequently used variables into the CPU gen- 
eral purpose registers (EAX, EBX, ECX, EDX). This was done because register 
access was the far quicker than cache access. This practice is not needed for 6x86 
architecture CPUs. The 6x86 L1 cache access time is the same as register access 
time. This also reduces “register pressure” limitations in the compiler. 
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Avoiding RISC-like Instruction Coding, Complex is Better 


2.3. Avoiding RISC-like Instruction Coding, 
Complex is Better 


For the CPUs that have pairing constraints, optimization of code consists of partly 
of turning CISC instructions into RISC-like instruction equivalents. The RISC 
equivalents increase code size that end up taking more space in the cache. 


The 6x86 and 6x86MX are designed to accelerate complex x86 instructions and do 
not have pair optimization constraints. For this reason, it is recommended not to 
break complex instructions into RISC equivalents for 6x86 family of CPUs. 


The instruction below takes only one clock cycle to execute. 


add [mem], eax 


If this instruction is broken into three RISC-like instructions three clock cycles are 
required: 


mov ebx, [mem] 
add ebx, eax 


mov [mem], ebx 


As another example, a loop instruction takes only one clock cycle: 


loop foo 
Using the following RISC-like coding takes two clock cycles: 


dcc ecx 


jle foo 
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RAW Dependencies 


2.4. RAW Dependencies 
Some Read-After-Write (RAW) dependencies can cause a stall. 
For example avoid: 


add [mem] ,bx 
add cx, [mem] 


The second add instruction stalls because the memory access [mem] must wait for the 
first instruction to complete updating. 


2.5. Don’t Make Calls Without a Matching RET. 


Making a CALL, without a return, will result in branch miss predictions. This will 
hinder speculative execution. 


For example avoid: 


pushoffset Main_PGM 
jmp SubRoutine 


Main_PGM proc 

Main_PGM ands 

SubRoutine_PGM proc 

ret 

SubRoutine endp 
Instead call the subroutine. 

call SubRountine_PGM 
rather than: 


push offset main_PGM 
jmp SubRoutine 
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Don’t Optimize for the FPU Like a Pentium 


2.6. Don’t Optimize for the FPU Like a Pentium 


The 6x86 FPU is not pipelined. It can only execute one FPU instruction at a time. 
This is typically an issue when hand-coding FPU instructions in assembly lan- 
guage. Most compilers do not make highly optimized FPU code. Use 486 FPU opti- 
mization compiler switches. 


The FXCH instruction is not paired with other FPU instructions on the 6x86. The 
6x86 FXCH instruction takes three clocks and two clocks on the 6x86MX. 


2.7. Mix Integer and FPU Instructions 


The 6x86 CPU will execute integer instructions and FPU instructions at the same 
time. 


2.8. Mixing 16 And 32 Bit Code 


There is no penalty for mixing 16 and 32 bit code so there is no penalty for one 
prefix. 


2.8.1 Prefix Issues 
There is a one clock penalty for two to six prefix's (with the following caveats): 


A one clock penalty also occurs if an instruction has a(n): 


e Address size override prefix with a displacement in the address. 
e Operand size override prefix and immediate operand. 
e Decode length of an instruction is greater than valid length of the instruction 
queue 
Only the first instruction is decoded when: 


e The second instruction has more than one prefix 


e Length of the first instruction more than six bytes 
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Branch and FPU Optimization 


e The first instruction is six bytes long and the second has a prefix 


e The length of the first and second instruction together is greater than the 
valid instruction queue length. 


¢ The last byte of instruction one is OF and the second has no prefix 


e The first instruction has taken a predicted change of flow. 


2.9. Branch and FPU Optimization 


The CPU will speculatively execute up to four FPU or jump (JMP or Jcc) instruc- 
tions at any one time. The fifth FPU instruction or jump will cause a stall until one 
of the earlier jump or FPU instructions reaches completion. 


2.9.1 No Penalty When Using Partial Registers 


The 6x86 core has no problem with mixing code that uses 8, 16 or 32 parts of the 
same register on successive instructions. 


For example: 


mov bh, [mem1 ] 
add ebx, [mem2 ] 


does not cause a stall condition due to using register BH then EBX. 
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Self-Modifying Code Should Be Avoided 


2.9.2 Write Gathering For Video Memory or 
Other Memory Mapped I/O 


The 6x86 and 6x86MxX processors have the ability to be programmed on a region 
by region basis for optimization of different memory types. The Data books and 
BIOS writers guides detail all the available options. 


Application Note 103 6x86MX BIOS Writers Guide defines Region 7 for memory 
above physical memory. Typically this is where the frame buffer resides. Write 
combining is suggested to be enabled for Region 7. If Region 7 has write combin- 
ing enabled, video memory in this region will be optimized. 


2.10. Self-Modifying Code Should Be Avoided 


Self-modifying code can have significant negative performance impact due to the 
need to flush CPU state information to keep caches and internal information 
coherent once the CPU has detected that code has been modified. While interesting, 
most of the time, self modifying coding will be slower than other programming 
techniques. 


Self-modifying code is detected when a write occurs to a location that uses a 256 
byte instruction line buffer (ILB). The ILB can be thought of as a prefetch buffer. 
Self-modifying code detection can also be detected when data and code lie on the 
same cache line and data is changed. 


An indication that self modifying code is being used is when a CS: override is used 
in assembly language code. 


2.11. Exclusive Instructions Use Both Pipes 


These exclusive instructions execute in the X pipe only and use resources in the Y 
pipe. No instruction is paired. This will have a negative performance impact. The 
next instructions that follow will enter the pipes once the exclusive instruction 
leaves the ACI stage. 
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Some Instructions Only Go Down the X Pipe 


The instructions are: 


AAM, ARPL, BOUND, CALL, CLI, CLTS, DIV, ENTER, HLT, IDIV, IMUL, IN, 
INS, INT, INTO, INT1, INT3, INVD, INVLPG, IRET, Jump Indirect, Jump Inter- 
seg, LAXR, LEAVE, LGDT, LXS, LLDT, LMSW, LSL, LTR, MOVseg/sr, MOVS, 
MUL, OUT, OUTS, POPA, POPF, POP es/ss/ds, POP fs/gs, PUSHA, PUSHF, RET, 
SGDT, SIDT, SLDT, SMSW, STI, CLI, STD, CLD, STR, VERR, VERW, WAIT, 
WBINVD, LDS, LES, LGS, LGS, LSS, JCXZ, LOOP, CMOPXCHG, BS WAPC- 
MPS, SCAS, XLAT. 


Most of these instructions are typically used by operating system code. 


2.12. Some Instructions Only Go Down the X Pipe 
The following instructions always go down the X pipe: 
Jcc, JMP, CALL, SET cc, and FPU instructions. 


Other instructions are paired and sent down the Y pipe. 


2.13. Unified Cache Architecture Issues 


The Cyrix 6x86/6x86MX CPUs use a unified cache design. This means there is 
only one primary or L1 cache for both data and code. This typically provides a 
higher hit rate than the Harvard (split data and code) cache design used in the Pen- 
tium. But there exists the possibility for large data operations that data will fill the 
entire cache and causing code misses. This issue can be addressed by writing criti- 
cal code to fit in the instruction line buffer. 


On the 6x86MX there is another way to address this issue by locking-down critical 
code in the cache. See “Cache Line Locking to Aid Real Time Software” on page 
21 for more information on Cache Line Locking on the 6x86MX. 


The unified cache is effectively dual ported. It is possible to access two data items 
in the same clock if there is no bank conflict. 


Two simultaneous misaligned word accesses will result in an extra clock delay in 
the AC2 stage. 
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Code Branch Alignment 


A three clock penalty results from a write followed by aread to the same memory 
location. 


2.14. Code Branch Alignment 


Align branch targets to eight byte boundaries. The CPU will fetch 16 bytes on 8 
byte boundaries. 


2.15. Data Alignment 


Don't span a data item across eight-byte boundaries. 


2.16. Branch Miss Predictions Should Be Avoided 


Extra clock-cycle delays occur when a CPU’s branch prediction predicts incor- 
rectly. These delays typically result from branch repairs that require using addi- 
tional resources. With register renaming, branch repair has a penalty of one clock. 


Code should be designed to flow top to bottom with fewer loops. The trade off is 
code size. 


Loops are initially predicted taken. 


2.17. SMM 


Cyrix SMM implementation is different than other vendors in terms of what is 
saved and restored automatically on entrance and exit of SMI interrupt. The Cyrix 
design minimized the overhead for entry and exit of the handler. This minimal CPU 
state save and restore resulted in fast entry and exit of the SMI handler. This imple- 
mentation permits not only faster power management decisions but also allows for 
virtualization of peripherals (i.e., MediaGX). 
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SMM 


The Cyrix SMM implementation is fully software configurable. This permits for 
using TSRs, or device drivers to act as SMM handlers. 


For a detailed discussion refer to Application Note 107 6x86MX SMM Design 
Guide. 
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CPUID and Returned Feature Bits 


3. 6x86 Unique Features 


3.1. CPUID and Returned Feature Bits 


The 6x86 and 6x86L implement the CPUID Instruction, however for compatibility 
with previous CPUs the CPUID instruction is not enabled by BIOS. See Cyrix ID 
Application Note 112 on the Cyrix Developers web page for detailed information. 
The home web page is www.cyrix.com. 


The CPUID instruction execution, with EAX=1, loads the feature bits into the EDX 
register. For the early 6x86 devices, the only feature bit that is enabled is bit 0 for 
the FPU. 


The 6x86L supports an FPU, Debug Extensions (also known as I/O breakpoints) 
and the Compare Exchange 8 Byte instruction. The 6x86L can be identified by 
DIRO values 2h. The 6x86L also adds support for the instruction MOV to and from 
CkR4. 


3.2. Cache Organization 


The 6x86 contains a dual ported 16K unified cache with 512 lines and 32 bytes per 
line. 


3.3. TLB Organization 


The TLB on the 6x86 is a 128-entry direct-mapped TLB, backed up by an 8-entry 
fully associative victim TLB. Both TLBs are dual ported. This allows both integer 
pipes to access TLB entries simultaneously. The value of 136 should be used in 
algorithms that make decisions about flushing individual entries with the INVLPG 
instruction or all the entries with a MOV to CR3 instruction. 
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6x86MX Unique Features 


4.1. 


CPUID Bits 


CPUID Bits 


The 6x86MX is initialized with CPUID enabled. CPUID will return the following information for the 6x86MX 


CPU: 


[ ——~—“=~is*~*~*™*™~:””””:SOMemTRESs—“‘“‘(((N(NSNNCS”COC#«CX@#C#COW#O#O#;#;©éSpamA.~CSY 
Vender ID String “CyrixInstead” 
CPUID Levels Supported 1 
Family 6 
Model 0 
Stepping TBD 
Feature Flag Value 0x0080A135 
FEATURE FLaGs SET 
FEATURE FLAG SET FEATURE 


FPU Present 


6x86 MX contains an enhanced FPU. 


O Breakpoints 


VO cycles can be trapped, controllable by CR4:DE. 


Time Stamp Counter Supported 


RDTSC instruction is supported, controllable by 
CR4:TSD. 


RDMSR and WRMSR Instructions Present 


Model Specific Registers are supported. 


CMPXCGH8B Instruction Supported 


Compare exchange eight byte instruction supported. 


PTE Global Bit Support 


PTE TLB will not be flushed when CR3 is written. 


CMOV and FCMOV Instructions Supported 


Conditional move instructions supported. 


MxX™ Instructions Supported 


Multi Media Instruction Extensions supported. 
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MMX™ Instructions and Optimizations 


4.2. MMX™ Instructions and Optimizations 


All instructions execute in one clock except: Packed Multiply, Packed Multiply- 
and-Add, and Dword Mov from an MMX™ register to a x86 core register. A new 
MMxX™ instruction can be issued during each clock. 


4.2.1 Fast FPU/MMX™ Switching 


Certain processors, such as the Penttum™ with MMX, take as many as 50 or 60 
machine cycles to switch between FPU and MMX™ instruction execution. The 
6x86MX does not have this limitation. It executes the EMMS instruction in one 
clock. 


4.2.2 Extended MMX™ Instructions 


Cyrix has added instructions to its implementation of the Intel MMX™ Architec- 
ture in order to facilitate writing of multimedia applications. All of the added 
instructions follow the SIMD (single instruction, multiple data) format. Many of 
the instructions add flexibility to the MMX™ architecture by allowing both source 
operands of an instruction to be preserved, while the result goes to a separate regis- 
ter that is derived from the input registers. 


4.2.3 Detecting Extended MMX™ Instructions 

1) Check for Family 6 Model 0 - Extended MMX™ supported 

2) Future CPU’s can be checked by the Extended Feature Flag 
CPUID Extended Flag[24] - Extended MMX supported 


See the Cyrix Developer web site for the current extended CPUID description. 
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MMX™ Instructions and Optimizations 


Cyrix Extended Instructions To MMX™ Instruction Set 


MMX™ INSTRUCTIONS 


OPCODE 


OPERATION AND CLOCK COUNT 


PADDSIW Packed Add Signed Word with Saturation 
Using Implied Destination 

MMX Register plus MMX Register to | mplied Register 
Memory plus MMX Register to Implied Register 


PAVEB Packed Average Byte 

MMX Register 2 with MMX Register 1 
Memory with MMX Register 

PDISTIB Packed Distance and Accumulate 
with Implied Register 

Memory, MMX Register to Implied Register 


OF51[11 mm1 mm2] 
OF51 [mod mm r/m] 


OF50 [11 mm1 mm2] 
0F50 [mod mm r/m] 


OF 54 [mod mm r/m] 


Sum signed packed word from MM X register/ 
memory --->signed packed word in MMX register, 
saturate and write result ---> implied register 


Averagepacked bytefrom theM MX register/memory 
with packed byte in the MMX register. Result is 
placed in the MM X register. 


Find absolute value of difference between packed 
bytein memory and packed bytein theMM X register. 
Using unsigned saturation, accumulate with value in 
implied destination register. 


PMACHRIW Packed Multiply and Accumulate 
with Rounding 
Memory to MMX Register 


PMAGW Packed Magnitude 

MMX Register 2 to MMX Register 1 

Memory to MMX Register 

PMULHRIW Packed Multiply High with Rounding, 
Implied Destination 
MMX Register 2 to MMX Register1 
Memory to MMX Register 


OF 5E[mod mm r/m] 


OF52 [11 mm1 mm2] 
0F52 [mod mm r/m] 


OF 5D [11 mm1 mm2] 
OF 5D [mod mm r/m] 


Multiply thepacked word in theM MX register by the 
packed word in memory. Sum the 32-bit results 
pairwise. Accumulate the result with the packed 
signed word in theimplied destination register. 


Set the destination equal --->the packed word with 
the largest magnitude, between the packed word in 
the MMX register/memory and the MMX register. 


Packed multiply high with rounding and store bits 30 
- 15in implied register. 


PMULHRW Packed M ultiply High with Rounding 
MMX Register 2 to MMX Register1 
Memory to MMX Register 


PMVGEZB Packed Conditional Move If Greater Than or 
Equal to Zero 
Memory to MMX Register 


OF59 [11 mm1 mm2] 
0F59 [mod mm r/m] 


OF5C [mod mm r/m] 


Multiply the sgned packed word in the MMX 
register/memory with the signed packed word in the 
MMX register. Round with 1/2 bit 15, and store bits 
30 - 15 of result in the MMX register. 


Conditionally move packed byte from memory ---> 
packed bytein the MMX register if packed byte in 
implied MM X register is greater than or equal ---> 
zero. 


PMVLZB Packed Conditional M ove lf Less Than Zero 
Memory to MMX Register 


OF 5B [mod mm r/m] 


Conditionally move packed byte from memory ---> 


packed bytein the MMX register if packed byte in 
implied MMX register is less than zero. 


PMVNZB Packed Conditional M ove If N ot Zero 
Memory to MMX Register 


OF5A [mod mm r/m] 


Conditionally move packed byte from memory ---> 
packed bytein the MMX register if packed byte in 
implied MMX register is not zero. 


PMVZB Packed Conditional M ove If Zero 
Memory to MMX Register 


OF58 [mod mm r/m] 


Conditionally move packed byte from memory ---> 
packed bytein the MMX register if packed byte in 
implied the MMX register is zero. 


PSUBSIW Packed Subtracted with Saturation 
Using Implied Destination 

MMX Register 2 to MMX Register1 

Memory to MMX Register 


OF55 [11 mm1 mm2] 
0F55 [mod mm r/m] 


Subtract signed packed word in the MM X register/ 
memory from signed packed word in the MMX 
register, saturate, and write result ---> implied 
register. 
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MMX™ Instructions and Optimizations 


4.2.4 Implied Registers 


Implied registers provide a third register for Cyrix Extended MMX instructions. 
The implied register is used as a destination register for results, so that the source 
register’s contents are not overwritten. 


For example, the IDCT (Inverse Discrete Cosine Transform) algorithm, used in 
MPEG video decode, has several places where two vector inputs are used in two 
separate calculations. In one calculation, the two vectors may be added, and in the 
second one of the vectors are subtracted from the other. In order to accomplish this 
algorithm using the basic MMX instructions, one of the vectors must be copied in 
order to preserve its original value before the first computation. This is because the 
MMxX instructions all destroy the contents of one of the source registers by using 
the same register as the destination. 


Several of the Cyrix-added MMxX instructions get around this problem by having an 
implied destination register, which is derived from the first source register. This 
way, the contents of both source vectors is preserved without having to make a copy 
of either one. A few of the instructions use an implied register as another source, so 
that the first register in the instruction is still the destination. 


The implied register is calculated from the first source, according to the following 
table: 


IMPLIED REGISTER PAIRS 


FiRST SOURCE IMPLIED 
REGISTER REGISTER 
mm0 mml 
mml mm0 
mm2 mm3 
mm3 mm2 
mm4 mm5 
mm5 mm4 
mm6 mm7 
mm7 mm6 


As implied from the table, the source and destination registers are in pairs, where 
the pairs are determined by changing the least significant bit of the binary represen- 
tation of the register number. 
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MMX™ Instructions and Optimizations 


4.2.5 Implied Instruction Examples 


The PADDSIW instruction performs the same function as the basic MMX 
PADDSW instruction, except that it preserves the contents of both input vectors. If 
one of the vectors of interest is in register mm1 and the other is in register mm2, the 
instruction would look like this: 


PADDSIW mml1, mm2 ;result in mm0 


and the result would end up in register mm0. The instruction could also 
be written as: 


PADDSIW mm2, mml1 ;result in mm3 


and the result would end up in register mm3. In this particular instruction, the sec- 
ond input can also be a memory operand, but the implied register stays the same, so 


PADDSIW mml1, [si] ;result in mm0 


puts its result in register mm0. 


Caution is required for programming with these instructions in order for them to 
have the desired effect. For example, 


PADDSIW mml1, mm0 ;result in mm0 


will put its result in register mmO, thus losing the original input value. The instruc- 
tion written this way is exactly equivalent to 


PADDSW mm0O, mml1 


A few of the instructions that use an implied register still use the first register in the 
instruction as the destination. These instructions are the packed conditional move 
commands PMVZB, PMVZNB, PMVLZB, and PMVGEZB. Note that the mne- 
monics for these instructions do not have the “I” for “implied destination” in them, 
so there should be no ambiguity about where the result goes. In the case of the 
packed conditional move instructions, the packed values from the source are moved 
as packed values to the destination register, depending upon the packed values in 
the implied register. They are three-input instructions. 


These instructions are beneficial for numerous algorithms, for example: 


Implied destination instructions: PADDSIW, PSUBSIW, PMULHRIW 
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MMX™ Instructions and Optimizations 


These are used for preserving the first src operand. Using them properly can over- 
come the short comings of two operand MMxX instructions. 


Fixed-point mode higher accuracy multiply with rounding: PMULHRW, 
PMACHRIW, and PMULHRIW provide higher accuracy multiplication. The 
result is a 1.15 instead of Intel 2.14 format. This is important in digital filters, 
video/audio/speech coding where improved accuracy is needed. 


4.2.6 Examples of Extended MMX Instructions 
Applications 


Average Instruction: PAVEB 


PAVEB averages two MMX registers in byte partitions. This is useful in algorithms 
such as motion compensation. 


Magnitude Instruction: PMAGW 


This is useful in signal scaling by finding the largest magnitude among a number of 
samples. 


Distance instruction: PDISIB 


PDISIB calculates with an implied dst = SUM (a-b). The instruction works on byte 
a partition. This is appropriate for motion estimation which is part of many video 
compression algorithms. 


Conditional Move Instructions: 
PMVGEZB, PMVLZB, PMVNZB, PMVZB 


This group of instructions uses byte partition. These instructions can be used for 
image manipulation, such as the chroma keying algorithm. 


4.2.7 Enabling Extended MMX Instructions 


The Extended MMxX< instructions are disabled by default. These instructions are 
enabled by setting Bit 0 of CCR7 (index Oxeb) to 1. To do this will require ring 0 
access which for some operating systems will mean a ring O device driver. 
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SMM Enhancements 


4.3. SMM Enhancements 


SMM has been enhanced to support reentrancy or nesting of SMI’s. This will facil- 
itate servicing an SMM interrupt within an SMM context. Code and Data are 
cached on the 6x86MxX to enhance performance. 


See Application Note 107 6x86MX SMM Design Guide for details. 


4.4. Cache Organization 


The 6x86MX contains a dual ported 64K write-back unified cache. The cache is 
organized as a 4-way set associative cache with 2048 lines and 32 bytes per line. 


4.5. TLB Organization 


The 6x86MX TLB is organized into two levels. The first level is 16 entry direct. 
The second level is 384 entry 6 way. The value 400 should be used in algorithms 
that make decisions about flushing individual entries with the INVLPG instruction 
or all the entries with a MOV to CR3 instruction. 


4.6. Performance Monitors 


See the 6x86MX Data book for a detailed listing and discussion of performance 
monitors. 


Note that the 6x86MX contains performance monitors that are not provided by 
from other CPU manufactures. When applicable the 6x86MX performance moni- 
tors overlap with the Intel® Pentium® P55C CPU. It is recommended that if perfor- 
mance monitors are used that the usage be conditioned by the CPU vendor and 
CPU Family and Model. 


Due to the architectural differences between the Cyrix 6x86MX and the Intel P55C 
the counters may have different meanings with relation to TLB architecture, Cache 
architecture, and pipe naming. 
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Performance Monitors 


The following Event Counter Types are different between the 6x86MX and Intel 


P5SC: 
Event CounTeR TyPE DIFFERENCES 
NUMBER(HEX) 6x86MX DESCRIPTION 
07 External Inquires 
08 External Inquires that hit 
OD L2 TLB Code Misses 
10 Reserve 
ll Reserve 
17 Instructions Executed in the Y pipe 
20 Reserve 
21 Reserve 
2A Reserve 
2C Reserve 
2E Reserve 
30 Reserve 
31 MMxX Instruction Data Reads 
32 Reserve 
35 Reserve 
36 Reserve 
40 L2 TLB Misses (Code or Data) 
41 L1 TLB Data Miss 
42 Ll TLB Code Miss 
43 Ll TLB Miss (Code or Data) 
44 TLB Flushes 
45 TLB Page Invalidates 
46 TLB Page Invalidates that hit 
48 Instructions Decoded 
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Cache Line Locking to Aid Real Time Software 


4.7. Determining Clock Multiplier 


The clock multiplier of the 6x86MX can determined by reading the bottom 4 bits of 
DIRO. The register index for DIRO is FEh. 


DIRO[3:0] CLOCK MULTIPLIER 
0 lx 


1 
2 
3 3x 
4 


NLA] U 
a 
ei 
ta 


4.8. Cache Line Locking to Aid Real Time Software 


The 6x86MX adds the unique ability to lock down lines in the primary or L1 cache 
of the CPU. This is valuable for Real Time applications. By locking down code or 
data in the L1 cache it guarantees that the information is kept in the L1 cache until 
software unlocks the area and the L1 cache LRU replaces the data. The only nega- 
tive impact is the loss of the cache lines for normal cache operation. 


Items locked down stay coherent only with that CPU, but are not guaranteed coher- 
ent with main memory. 
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Cache Line Locking Operations 


4.9. Cache Line Locking Operations 


The Cache Line locking feature is controlled by using MSR3, MSR4, and MSR5. When a line is unlocked, the 
line is marked invalid to avoid write backs to non-existent memory. 


31 24 23 22 20 19 18 16 15 12 11 87 65 432 0 
S 
M V MESI MRU SET CTL | MSR5 
| 
31 21 O 
ADDRESS MRS4 
31 0 
DATA MSR3 
Cache TEST REGIsTeR Bit DEFINITIONS 
REGISTER FIELD 
NAME NAME RANGE DESCRIPTION 
MSR5 SMI 23 SMI Address Bit. Selects separate/cacheable SMI code/data space 
V MESI 19-16 | Valid, MESI Bits* 
f = 1000, Modified 
f = 1001, Shared 
f = 1010, Exclusive 
f = 0011, Invalid 
f = 1100, Locked Valid 
f = 0111, Locked Invalid 
Else = Undefined 
MRU 11-8 Used to determine the Least Recently Used (LRU) line. 
SET 5-4 Cache Set. Selects one of four cache sets to perform operation on. 
CTL 1-0 Control field 
f = 00: flush cache without invalidate 
f= O01: write cache 
f= 10: read cache 
f= 11: no cache or test register modification 
MSR4 ADDRESS 31-2 Physical Address 
MSR3 DATA 31-0 Data written or read during a cache test. 
*Note All 32 bytes should contain valid data before a line is marked as valid. 
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MSR5 CacHe CONTROL OPERATIONS: 


ACTION ECX EDX EAX OPERATION 
Read/ 03h -—-- Cache Data Data to/from MSR3 
Write 
Write 04h Address Upper Address Lower 32 Bits | Data at EDX:EAX- 

32 Bits >MSR4 
Read 04h Address Upper Address Lower 32 Bit SR4 -> EDX:EAX 
32 Bits 
Write 05h -—- Data Function_MSR5 (EAX) 
Read O5h fo -H=- Data Read MSR5 -> EAX 


Lock information is kept in the MESI bits. 


4.10. Cache Modification Actions Effects 


on Locked Lines 


CACHE MopIFICATION ACTION EFFECTS ON LOCKED LINES 


ACTION EFFECT ON LOCK BITS 
Power on Reset Cleared 
Reset Cleared 
Warm Reset Unaffected 
Flush Unaffected 
WBINVD Unaffected 
INVD Unaffected 
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Cache Line Locking Guide Lines 


4.11. Cache Line Locking Guide Lines 


Cache line locking is more effective when the following suggestions are followed: 


1) Don't lock all ways or sets of the cache. Inspect the cache before lock- 
ing. Avoid locking set 3 so it will be available for normal cache operation. 


2) Do not allocate an address twice in a cache block. The results will be 
catastrophic. 


Check for an address already being locked. 


See “Cache Line Locking Operations” on page 22 for additional information. 


4.12. Cache Line Locking Example Code 


As time permits, Cache Line Locking examples will be placed on the Cyrix 
Software Developer web page. 
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Cache Line Locking Example Code 


Appendix A. 
Summary of Differences Between 6x86 and 6x86MX CPUs 


FEATURE 6x86 6x86MX 
Cache 16K Size 64K Size 
4-way 4-way 
512 lines 2048 lines 
32 bytes per line 32 bytes per line 
TLB 128/8 16/384 
BIB 256 512 
Lockable Cache no yes 
Time Stamp Counter no yes 
Performance Counters no yes 
Global PTE no yes 
CPUID enabled no yes 
MMx no yes 
Extended MMX no yes 
Nestable SMI support no yes 
CMOV no yes 
SMI code/data cacheable no yes 
Prefetch Queue Depth 64 bytes 64 bytes 
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Appendix B. 
Web Page for Software Vendor Support 


For more information, help, or to contact Software Vendor Support: 


http://www.cyrix.com/developers/software/isv.htm 
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Cache Line Locking Example Code 


Appendix C. 
6x86™ and 6x86MX™ Technical Documents. 


6x86 Data Book 

6x86 BIOS Writer's Guide 

6x86 SMM Programmer's guide 

6x86MX Data Book 

6x86MX BIOS Writer's Guide 

6x86MX SMM Programmer's Guide 

SMM Programmer's Guide. 

Application Note 101 Board Design and Bus Differences 
Application Note 102 Signal and Bus Description 
Application Note 103 BIOS Writer's Guide 

Application Note 104 Fan Voltage Regulator and Chipset Guide 
Application Note 105 Thermal Considerations 
Application Note 106 CPU Optimization 

Application Note 107 SMM Design Guide 

Application Note 108 Cyrix MMX Extension 
Application Note 112 Cyrix CPU Detection Guide 
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by Cyrix, not all device characteristics are necessarily tested. Cyrix assumes no liability, unless specifically agreed to in writing, for custom- 
ers’ product design or infringement of patents or copyrights of third parties arising from use of Cyrix devices. No license, either express or 
implied, to Cyrix patents, copyrights, or other intellectual property rights pertaining to any machine or combination of Cyrix devices is 
hereby granted. Cyrix products are not intended for use in any medical, life saving, or life sustaining system. Information in this document is 
subject to change without notice. 
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Rev 1.7 Added Prefetch Queue to Differences Table Appendix A, Page 25 
Rev 1.6 Changed TRx to MSRx 
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